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Abstract —Model order reduction (MOR) techniques play a 
crucial role in the computer-aided design of modern integrated 
circuits, where they are used to reduce the size of parasitic net¬ 
works. Unfortunately, the efficient reduction of passive networks 
with many ports is stiU an open problem. Existing techniques do 
not scale well with the number of ports, and lead to dense reduced 
models that burden subsequent simulations. In this paper, we 
propose TtirboMOR, a novel MOR technique for the efficient re¬ 
duction of passive RC networks. ThrboMOR is based on moment¬ 
matching, achieved through efficient congruence transformations 
based on Householder reflections. A novel feature of TtirboMOR 
is the block-diagonal structure of the reduced models, that makes 
them more efficient than the dense models produced by existing 
techniques. Moreover, the model structure allows for an insightful 
interpretation of the reduction process in terms of system theory. 
Numerical results show that ThrboMOR scales more favourably 
than existing techniques in terms of reduction time, simulation 
time and memory consumption. 

Index Terms —Model order reduction, many ports, moment 
matching, parasitics, partitioning. 

I. Introduction 

HILE designing VLSI chips, engineers need to take 
into account the parasitic resistance, capacitance and 
inductance of signal- and power-delivery interconnects, in 
order to prevent signal and power integrity issues Q-@- 
Electromagnetic solvers are used to extract RC or RLC in¬ 
terconnect models, which are then connected to non-linear 
devices for system-level simulations. Unfortunately, parasitic 
networks can be very large, featuring a huge number of 
components, nodes and ports. Direct simulation involving such 
large networks is often prohibitive. Model order reduction 
(MOR) is frequently used to reduce parasitic models to a 
manageable size, and accelerate subsequent simulations. 

Several approaches to MOR have been proposed in the last 
decades, such as node elimination Q, Krylov subspaces Q, 
1^, and balancing Q. Krylov methods are widely used for 
parasitic reduction, since they are more scalable than balancing 
methods. Among them, PRIMA Q is one of the most popular 
and widely used Krylov algorithms. PRIMA’s success is due 
to its ability to guarantee the passivity of the ROM, a manda¬ 
tory property to prevent divergent transient simulations Q. 
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Unfortunately, PRIMA can become very inefficient when 
applied to networks with many ports. PRIMA generates the 
reduced model through a congruence transformation with an 
orthogonal matrix that spans a suitable Krylov subspace. The 
orthogonal projection matrix is dense and can become very 
large when ports are many. Generating the ROM becomes 
very time consuming, since it involves products between large 
and dense matrices. In some cases, even storing the projection 
matrix can be challenging. Moreover, the obtained reduced 
model is dense, large and frequently slower than the original 
system. These issues affect most existing techniques and are 
an outstanding issue in MOR pO) . 

A number of techniques have been recently proposed to 
address such challenges. Methods like SVDMO R ||TT| , ESVD- 
MOR | [T2l , RECMOR | |T3| and several others |14|, | |T5| aim 
at reducing the number of ports before applying PRIMA. This 
is done by exploiting the correlation that may exist between 
different ports. However, practical networks with many ports 
rarely exhibit a high degree of correlation |T6| . 

In p7)-p9), the problem of reducing networks with many 
ports is simplified by clustering inputs into small groups, and 
reducing each subsystem individually. These methods generate 
accurate and block diagonal ROMs that are sparse. However, 
since subsystems are treated independently, passivity is not 
always guaranteed. 

Another method known as SIP | |20l offers a more efficient 
approach to moment matching for RC networks. Rather than 
explicitly constructing the projection matrix, sparse matrix ma¬ 
nipulations are used to generate the reduced matrices directly 
using the Schur complement, an idea also used in PACT pT) . 
This makes SIP more efficient than PRIMA for large networks 
with many ports. However, SIP can match only two moments 
per expansion point. This level of accuracy is not always 
sufficient for practical applications p0| , as we will show in 
Sec. The authors in p0[ suggest using multi-point moment 
matching 0, 0,0 to achieve more accuracy. However, the 
obtained reduced matrices can be singular, and avoiding this 
issue does not seem to be trivial. 

In the SparseRC method is proposed, combining 

graph-partitioning techniques p4| with a SIP-like reduction 
process. A divide and conquer strategy is used to partition 
the original system into smaller subsystems, then reduced 
separately with a method similar to SIP pO) . The resulting 
ROM has the same partitioned structure as the original system. 
Such a reduction strategy is efficient in terms of memory and 
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cpu time for large networks, since the problem of reducing 
the large system simplifies to reducing smaller subsystems 
that can be managed efficiently. The generated ROM is also 
sparse. SparseRC, however, like SIP, is limited to matching 
two moments per expansion point. While PRJMA can be used 
to match additional moments, as suggested in pT) , this reduces 
efficiency, because of the limitations of PRIMA discussed 
previously. 

In this paper, we propose TurboMOR, a novel MOR tech¬ 
nique for RC networks with many ports. TurboMOR achieves 
moment-matching without explicitly computing a dense pro¬ 
jection matrix as in PRIMA. Efficient and memory-conscious 
Householder reflections p5) are used to generate the reduced 
model, and match two moments per iteration. Differently from 
previous methods such as SIP pO) , an arbitrary number of 
moments can be matched, providing full control on accuracy. 
TurboMOR can be combined with partitioning p3), |24| to 


reduce very large networks. A key feature of TurboMOR is the 
block-diagonal structure of the reduced models, that addresses 
the poor efficiency of the dense models produced by existing 
moment-matching techniques. The block diagonal structure 
also lends itself to a novel and insightful interpretation of 
moment matching in terms of cascaded subsystems. The 
reduced models produced by TurboMOR are passive, retain 
the input-output structure of the original system, and can be 
synthesized into an equivalent RC netlist p^ . Numerical tests 
demonstrates the superior scalability of TurboMOR in terms 
of reduction time, simulation time, and memory consumption. 

The rest of the paper is organized as follows. In Sec. 
we state the problem and briefly review the foundations 


of moment matching. In Sec. Ill we discuss the theoreti 


cal derivation and practical implementation of TurboMOR. 
Sec. IIV] compares TurboMOR against the state of the art. In 
Sec. Wfwe draw our conclusions, and in the Appendix we 
provide some mathematical proofs. 


II. Problem Formulation 
We consider a passive network made by resistors and ca¬ 
pacitors with m nodes and p ports. Using nodal analysis 
the network can be described in the Laplace domain by the 
systems of equations 


Gx(s) -I- sCx(s) = Bu(s) 
y(s) = B^x(s) 


( 1 ) 


where vectors u(s) S and y(s) S collect all port 
currents and port voltages, respectively. Vector x(s) S M"* 
contains all nodal voltages. Matrices G, C G 
conductance and capacitance matrices, respectively. They are 
symmetric and non-negative definite. Matrix B G maps 

input ports to the nodal equations, and ^ denotes transposition. 
The transfer function of Q reads 

H(s) =B^(G + sC)^iB (2) 

The goal of MOR is to approximate 0 with a model of much 
lower order n <C m 


Gx(s) -f sCx(s) = Bu(s) 
y(s) = B^x(s) 


( 3 ) 


where G, C G B G and x(s) G M". This model 

must accurately capture the response of the original system 
across the frequency range of interest. 

One way of ensuring accuracy is through Pade approxima¬ 
tion, also known as moment matching. Around s = 0, the 
Taylor series expansion of 0 reads 

H(s) = Mo+Mis + M2s2 + ... (4) 

The coefficients are called moments of 0 at DC 0. 0. 
Q, and can be related to the systems matrices as 

Mfc = B'^(-G-iC)'=G-iB V/c = 0,l,2,... (5) 

The moments of the reduced model are defined similarly, as 
the Taylor expansion coefficients of the transfer function 

H(s) =B^(G-f sC)-iB (6) 

of reduced model ([^. 

The goal of moment matching is to generate a ROM ([^ 
that will match the first moments of the original system 

Mfc=Mfc Vfc = 0,...,2(z-1 (7) 

up to a given order controlled by q. Since, for RC networks, 

moments are typically matched in pairs, we denote the number 
of matched moments as 2q. By increasing q the ROM will 
become more accurate, but also larger. 

In PRIMA, moment matching is performed with a congru¬ 
ence transformation applied to the matrices of the original 
system 0 

G = Q^GQ, C = Q^CQ, B = Q^B (8) 

The columns of Q G span the Krylov subspace 

/C,(A, R) = span{R, AR, A^R,..., A^^-^R} (9) 

where A = —G~^G and R = G“^B. It can be shown 
that ROM 0 matches the first 2q moments of the original 
system. The reduced model is of size n = qp, and is passive 
by construction since congruence transformation 0 maintains 
the non-negative nature of G and C. The projection matrix Q 
is constructed numerically with the block Arnold! process 0, 
an orthogonalization procedure similar to the modified Gram- 
Schmidt process | p5| . Unfortunately, orthogonalization leads 
to a dense Q. As a result, when p is high, computing Q and 
projection products 0 can be very expensive. For very large 
networks, even storing Q becomes an issue, since its size can 
easily exceed several Gigabytes. Moreover, transformations 0 
lead to a dense ROM, which will burden any subsequent circuit 
simulation. These bottlenecks, which make existing methods 
quite inefficient for many-port networks, are tackled by the 
proposed method. 


III. Proposed Method 

In this section, we discuss the theoretical derivation of 
TurboMOR and how it can be implemented for maximum effi¬ 
ciency. The method works recursively, matching two moments 
per iteration. We discuss the first two iterations in detail, before 
generalizing. 
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Fig. 1. System theory interpretation of and l |10b) . The original system 
has been decomposed into two subsystems and S 2 \ decoupled at DC. 


A. Theoretical Derivation 


1) Matching Two Moments: The first iteration of the pro¬ 
posed method is analogous to p0| , pT) , | [23) . Nodes are first 
reordered in such a way that port nodes come first, followed 
by internal nodes. After reordering, system o reads 


(\Gii 

UG 21 


* 

G 22 



* 

C 22 


Xl 

X2 


y = 




0 ] 


Xl 

X2 



(10a) 


(10b) 


where xi € and X 2 G denote port and internal 

node voltages, respectively. The symbol * is used in symmetric 
matrices to denote the transpose of the symmetric block across 
the diagonal. For the purpose of shortening our notation, we 
do not indicate explicitly the dependency on s for input, output 
and state variables. Submatrix G 21 describes the resistive cou¬ 
plings present between internal and port nodes. We eliminate 
this block through Gaussian elimination, using the congruence 
transformation (|^ with Q given by 


= 


Ip 

-K-^K-iG2i 


0 

Im—p 


( 11 ) 


Matrix K is the Cholesky factor p5) of G 22 - For the time 
being, we assume G 22 to be positive definite (strictly). In 
Sec. III-C we will discuss how a singular G 22 can be handled. 
Matrix Ip is the identity matrix of size p x p. After the 
congruence, equations (|10a|i and (|10b|i become 


G 


11 

0 


0 

G 22 


.(1) 

.(1) 


y= [B 


J22 


Xl 

.(1) 


Xl 

^(1) 


Bi 

0 


(12a) 

(12b) 


where 


Gfi^ = Gii - G^iK-^K-iG2i (13) 

G^ = Cii - G^iK-^K-1G2i - C^iK-^K-1G2i 

+ G^iK-^K-1C22K-^K-iG2i (14) 
G^ = C 21 - C 22 K-^K-iG 2 i (15) 


With Gaussian elimination, all resistive couplings between port 
nodes and internal nodes have been eliminated, leaving only 
capacitive couplings. 


The obtained equations lend themselves to a useful in¬ 
terpretation in terms of system theory, depicted in Fig. 
System ( |12a| i-( [T2b] i can be seen as the cascade of a system 
of order p 


41 ) . 


G^xi 


■ sC^’^^Xi = 


Biu 


y = Xl 


and a system of order m — p 


41 ) . 


[ G22X^^^ - _r'(i).,(i) 


1 


■ sG22Xy' = -G21 

y2^^ = -(G2l’)^X^^^ 


(16) 


(17) 


Only the first subsystem is directly connected to the 
input/output ports of the network. Subsystem E^^^ is instead 
connected only to E^^\ through equations = sy^'^ and 
= sxi, which define time derivatives. The coupling 
between the two subsystems is thus purely dynamical. At DC, 
the second system is completely decoupled from E^^^ and the 
network ports, and has no influence on the transfer function 
H(s) between input u and output y. At low frequency, the 
coupling between the two is weak, and the overall system re¬ 
sponse is given mainly by E^^\ Therefore, the first subsystem 
alone can be interpreted as a ROM of order p of the original 
system 


G^i^xi + sCji^xi = Biu 

y = B^xi 


(18) 


In the Appendix, we indeed prove that ( fTSl l matches the first 
two moments of the original system at s = 0. From an 
accuracy standpoint, the proposed ROM is thus equivalent in 
size and accuracy to the ROMs generated by other moment 
matching techniques. Its computation, however, requires less 
effort, since its matrices and Cl can be computed 

cheaply using sparse matrix techniques. 

2) Matching Four Moments: In order to match more than 


two moments, the presence of E 


( 1 ) 


must be taken into account. 

loosing 


( 1 ) 


123 


Instead of applying PRIMA to E 
efficiency, we show how additional moments can be efficiently 
matched by further decomposing E^^^. 

First, we apply a congruence transformation to ( [T7] l using 
Q = K-^ in fl 


^m—p'^2 


( 1 ) 


sK-^CooK 


-T,(l) _ 


= -K-^C 




u:, 


yi‘’ = -(C 




where x^^^ = K 


■‘2 


(19) 

Z 2 '. This step turns G 22 into the identity 
matrix, and does not require expensive computations since K 
is already available from the previous iteration. 

Then, with a series of Householder reflections | [25l , we 
compute the QR factorization of the input-to-state matrix 
in (fT9l) 

rR(2)i 

( 20 ) 


(Q(2))^K-1cW = 


R( 2 )' 

0 


where S Rp^p is upper triangular and G 

]g(m-p)x(m-p) jg orthogonal matrix given by the product 
of Householder reflectors j^. 
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After is applied to ( [T^ with a congruence transforma¬ 
tion, the system will read 


I„ 0 

m — 2p 


Lp 

0 I„ 




21 




* 



^22 _ 

j 

_ 2 

■-R(2)' 

T 

xf)l 

0 


x(^)J 


-R(2) 

0 


4 ^^ 

(21a) 

(21b) 


where 

-( 2 ) 


= [0 I™_2p] (Q(2))^K-1C22K-^Q 


( 2 ) 


^m—2p 


= [0 I™_2p] (Q(2))^K-1C22K-^Q 


( 2 ) 


= [Ip 0] (Q(2))^K-1C22K-^Q 


( 2 ) 


( 22 ) 

(23) 

(24) 


System pia| i-( |2Tbl i is now in the form ( |12a| i-( [T2bl ), and the 
reduction process used in iteration 1 can be applied again. 
System ( |21a| l-( |2Tb| ) can be seen as the cascade of a first system 
of order p 


.(2) . / - R(2)uW 


y2 


( 1 ) _ 




( 2 ) 


( 2 ) 

and a second system of order m — 2p 


: 


T 

^m—2p-^2 


( 2 ) 


.(2)^(2) 

■"22 -^2 
( 2 ) 

y2 - 


( 2 )„( 2 ) 

21 ^2 
(2)iT^(2) 


= -iC^2l’r^: 


(25) 


(26) 


The two systems are only dynamically coupled, through equa¬ 
tions = sy 2 ^^ and U 2 = Overall, the original 

system ([T]) is now decomposed into three blocks, all coupled 
dynamically, as shown in Fig. |2| If we retain the first two 
blocks, and neglect E 2 % we obtained a ROM of order 2p 


di) 

'11 



0 


u (27a) 
(27b) 


As shown in the Appendix, this model matches the first 4 
moments of the original system. 

3) Matching More Than Four Moments: Additional mo¬ 
ments can be matched by iterating the proposed process, and 
further decompose subsystem in Fig. This goal can 
be achieved by computing, at each iteration j > 3, the QR 
decomposition 




■r(j)' 

0 


(29) 


of the input-to-state matrix of the innermost system (at iter¬ 
ation j = 3, matrix in (|26l)). The QR decomposition 
is obtained with a series of Householder reflectors that form 
the congruence matrix Q(-'1 The obtained system will have 
the same structure as ( |21a| l-( |2Tbl l, and can be seen as the 
cascade of two blocks. The first system of size p, will 



Fig. 2. Structure of the system obtained after two iterations of the proposed 
method. 


add two matched moments to the ROM computed up to that 
point. The second system will be further decomposed if j < q- 
Otherwise, at the last iteration, it will be discarded. After q 
iterations, the obtained ROM will have order pq, and will be 
in the form shown in equation ( |28l l at the top of the next 
page. In the Appendix, we prove that the obtained model 
matches 2q moments of the original network. The proposed 
technique therefore leads to a ROM of the same size and 
accuracy as PRIMA, but in a more efficient way, which 
avoids the explicit construction of a huge and dense projection 
matrix. In comparison to SIP pO) , that can match only two 
moments per frequency point, the proposed method can match 
an arbitrary number of moments, and does not suffer from the 
singularity issues of multipoint SIP pO) . The use of PRIMA 
to match additional moments, advocated in SparseRC p3| , is 
also avoided. 

Another key advantage of the proposed method is the 
block-diagonal structure of p8] l. Unlike PRIMA, that gener¬ 
ates dense models, the proposed method naturally leads to 
a sparse representation. This reduces the memory footprint 
of the ROMs, and accelerates subsequent simulations, as 
we shall see in Sec. [W] Although PRIMA models can be 
sparsified with an eigenvalue decomposition, this operation 
costs extra CPU cycles. The obtained models are stable and 
passive by construction, since only congruence transformations 
like (|^ have been used to generate the ROM matrices. The 
positive-definitive nature of G and C in Q is thus preserved, 
which implies passivity and guarantees stable transient simu¬ 
lations Jj9]|. We also note that TurboMOR preserved the matrix 
in ( |10a| i and ( |10b |l that maps input ports to state equations. 
As discussed in |26|, this property facilitates the connection of 
the ROM to the surrounding components. Finally, the obtained 
ROM can be converted into an RC equivalent circuit using the 
procedure in p6) , for seamless integration into existing tools 
for electronic design automation. 


B. Practical Implementation 

We now discuss how TurboMOR can be implemented 
for maximum efficiency in terms of CPU time and mem- 
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rp(i) 

'^11 

Ip 



r 1 
xp) 

-f s 

cJV * 

-R( 2 ) ■■■ 


r 1 
xf) 


Br 

0 


Ip. 


( q ) 

LXi J 




1 

1 _ 


- 0 - 


(28) 


ory consumption. The Cholesky decomposition of G 22 can 
be obtained using efficient routines for the factorization of 
sparse, positive-definitive matrices, such as the supemodal 
method | |28) available in MATLAB’s chol routine. The QR 
decomposition in ( |20| is computed with the Householder 
method. In our MATLAB implementation of TurboMOR, we 
used a direct call to the compiled LAPACK routine DGE- 
QRF 1291, which returns the orthogonal matrix in fac- 


kept in factored form. The LAPACK’s routine DORMQR ^9 \ 
can be used to compute products involving directly from 
its factorization. Being large and dense, matrix is also 
never computed explicitly. Its factored form is always used, 
which his given by ( | 22 ] l for j = 2 and by 

0 


(~^U) _ 

'^22 — 


[0 I(™-,p)](Q«)^C^-'^Q« 




(30) 


for j > 2 . 


nesdid from the SuiteSparse package |281. Once the network 


Gi 0 * 


'Cl 

0 

* 

\ 

0 G2 ^ 

-f s 

0 

C2 

* 


G31 G32 G3 


_C 31 C32 

C 3 _ 

J 



Xl 


Bi' 


X 

X2 

= 

B2 



X 3 


B3 


u ( 31 ) 


Blocks Gi,Ci and G 2 ,C 2 correspond to two decoupled 
subsystems, that interact only through a set of separator nodes 
associated to G 3 , via coupling matrices G 31 , C 31 , G 32 , 
C 32 . Subsystems 1 and 2 can be reduced individually. The 
coupling matrices are then updated accordingly. For instance, 
for reducing subsystem 1 , we first form its nodal equations 


tored form |251. Such matrix is never computed explicitly, but 


Gi 

G31 


* 

Ga 


Cl * ' 
C31 C3 


B 3 


u ( 32 ) 


and then reorder its nodes such that 

• port nodes and separator nodes come first, and form the 
state vector Xi in dTOal l; 

• internal nodes come second, forming X 2 in (| 10 a|l. 


Then, we perform the reduction as in Sec. III-A After all 
subsystems have been reduced, the obtained ROM will read 


Gi 0 

0 62 

.G 31 G 32 


* 

* 

Ga 


+s 


Cl 

0 

631 


0 

C2 

C32 


* 

* 

Ga 


\ 

Xl 


Bi' 


X2 

= 

B2 

J; 

.^3. 


B3 


C. On the Singularity of G22 

Throughout the derivation of TurboMOR, we assumed the 
block G 22 in ( | 10 a| l to be strictly positive definite, hence 
invertible. When this is not the case, we adopt the solution 
proposed in for SparseRC. The rows and columns that 
make G 22 singular are promoted into the first set of equations, 
and not eliminated. Since the number of such rows is typically 
very low, this does not significantly increase the size of the 
obtained ROMs. 

D. TurboMOR with partitioning 

Graph partitioning techniques can be integrated into Tur¬ 
boMOR to reduce very large networks, such as the power 
grid models that we will consider in Sec. IV A possible 
partitioning strategy, used in | | 2 ^ and p^ , is to partition 
the given network into subnetworks that interact only through 
a limited set of nodes, called separator nodes. An optimal 
partitioning can be found with the nested dissection algorithm 


nodes are reordered according to the partitions identified by 
nesdis, the matrices in 0 assume a bordered block diag¬ 
onal form pO) . To illustrate this, consider a three-component 
partitioning of 0 


(33) 

As numerical results will show, partitioning reduces the overall 
cost of the reduction, since TurboMOR is applied to subsys¬ 
tems of smaller size. Additionally, it reduces the number of fill- 
ins in the ROM, since the zero blocks in pT] ) are maintained 
in p3| ). 

IV. Numerical Results 

The proposed TurboMOR algorithm has been implemented 
in MATFAB, with direct calls to compiled FAPACK libraries 
for a few key operations, namely the QR decomposition 
of and the computation of the products with the House¬ 
holder matrices in this section, we compare the perfor¬ 

mance of TurboMOR against PRIMA Q and SparseRC p3) . 
Computations were performed on a 3.40 GHz Intel i7 CPU, 
with 16 GB of memory and MATFAB R2013b. 

A. Reduction Time 

Table shows the time needed by the different methods to 
reduce various test networks. Example 1 is an on-chip bus 
consisting of 128 signal lines. The bus was modelled with 
lumped RC segments, and has the characteristics of a global 
interconnect in the 65nm technology node pT) . Examples 2 - 
6 are power grid benchmarks obtained from ]32[. The original 
benchmarks include some inductors, which were neglected. A 
variable number of input current sources has been considered 
to investigate the scalability of the MOR methods with respect 
to port count. 

We first compare the proposed method without partitioning 
against PRIMA, in order to assess its intrinsic efficiency in 
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TABLE I 

Reduction time for the different methods on various test networks. All times are in seconds. 


Examples 

1 

PRIMA 
cpu time 

Spai'seRC 

TurboMOR 

TurboMOR with partitioning 

cpu time 

Speedup w.r.t 
PRIMA 

cpu time 

Speedup w.r.t 
PRIMA 

cpu time 

Speedup w.r.t 
PRIMA 

1. On-chip bus 

1 

0.94 

0.40 

2.35X 

0.28 

3.36X 

0.37 

2.54X 

p = 256 

2 

2.84 

1.83 

1.55X 

1.48 

1.92X 

1.58 

1.80X 

m = 38,528 

3 

4.78 

3.63 

1.32X 

3.00 

1.59X 

2.94 

1.63X 

2. ibmpglt (RC) 

1 

0.41 

0.16 

2.56X 

0.19 

2.16X 

0.18 

2.28X 

p = 200 

2 

1.10 

0.51 

2.16X 

0.66 

1.67X 

0.45 

2.44X 

m = 25,195 

3 

1.98 

0.94 

2.11X 

1.38 

1.43X 

0.85 

2.33X 

3. ibmpg2t (RC) 

1 

22.00 

6.24 

3.53X 

10.84 

2.03X 

6.28 

3.50X 

p = 800 

2 

65.55 

20.89 

3.14X 

37.82 

1.73X 

18.89 

3.47X 

m = 163,697 

3 

118.64 

39.48 

3.01X 

78.38 

1.51X 

35.28 

3.36X 

4. ibmpg2t (RC) 

1 

35.51 

9.53 

3.73X 

16.64 

2.13X 

9.52 

3.73X 

p = 1200 

2 

109.41 

32.63 

3.35X 

60.60 

1.81X 

29.19 

3.75X 

m = 163,697 

3 

224.06 

64.41 

3.48X 

132.55 

1.69X 

56.58 

3.96X 

5. ibmpg2t (RC) 

1 

49.50 

11.66 

4.25X 

21.29 

2.33X 

11.76 

4.21X 

p = 1500 

2 

152.81 

43.06 

3.55X 

83.66 

1.83X 

37.58 

4.07X 

m = 163,697 

3 

729.92 

83.76 

8.71X 

186.95 

3.90X 

72.32 

10.09X 

6. ibmpg2t (RC) 

1 

73.17 

16.29 

4.49X 

31.23 

2.34X 

16.29 

4.49X 

p = 2000 

2 

340.18 

62.12 

5.48X 

228.32 

1.49X 

54.76 

6.21X 

m = 163,697 

3 

9807.11 

122.12 

80.31X 

1051.72 

9.32X 

115.71 

84.76X 


matching moments. For each test case, reduced order models 
have been generated to match 2, 4, and 6 moments. From the 
results in Table |Ij we observe that TurboMOR is consistently 
faster than PRIMA, up to 9.32 times. Savings are particularly 
high when order and port count are high, as in example 6. 
While PRIMA takes 2 hours and 43 minutes (9807 s) to 
match 6 moments, TurboMOR achieves the same result in only 
17.5 minutes (1051 s). This speed-up is due to the fact that 
TurboMOR achieves moment matching without computing 
and storing a large projection matrix as PRIMA does. 

Then, we compare TurboMOR with partitioning against the 
recently-proposed SparseRC method | |23| . From Table we 
observe that partitioning improves reduction time substantially, 
especially for large networks (examples 3, 4, 5 and 6). Com¬ 
paring the proposed method and SparseRC, we see that for two 
moments matched {q = 1), both methods have almost the same 
reduction time. This is expected since, in this case, the methods 
perform the same operations. However, when additional mo¬ 
ments are matched (q = 2 and q = 3), the proposed method 
is always faster than SparseRC, which employs PRIMA to 
match additional moments, losing some efficiency. This result 
shows how, with the Householder transformations proposed in 
Sec. III-A additional moments can be efficiently matched. 


B. Accuracy of the Reduced Models 

In this section we demonstrate that, from an accuracy 
standpoint, TurboMOR is equivalent to PRIMA. For this pur¬ 
pose, we consider the power grid “ibmpglt” from Q, which 
corresponds to example 2 in Table |I] A transient simulation 
is performed to calculate the voltage at one of the supply 
ports of the power grid, when switching currents are drawn 
by the different blocks of the integrated circuit. Fig. shows 



0 2 4 6 8 10 

Time [ns] 


Fig. 3. Transient response of the original system and the reduced models 
obtained with TurboMOR and PRIMA. The reduced models match two 
moments (q = 1). 



Time [ns] 


Fig. 4. Error between the response of the original system and the response 
of the reduced models computed with PRIMA and the proposed method. The 
reduced models match two moments (q = 1). 








































OYARO AND TRIVERIO - TURBOMOR 


7 



Time [ns] 

Fig. 5. As in Fig.[^ but for four moments matched (ij = 4). 


X 10"'' 



Fig. 6. As in Fig. but for four moments matched (g = 4). 

the time response obtained with the original system and the 
reduced models from TurboMOR and PRIMA, for the case of 
two moments matched (q = 1). Both methods provide similar 
results. This confirms that the proposed method is as accurate 
as PRIMA, but more efficient. 

In Fig. 1^ the maximum error for the two ROMs is depicted. 
Figures show that a ROM with only two moments matched is 
not suitable for an accurate assessment of the voltage drop 
across the power grid. Indeed, the ROMs underestimate the 
voltage drop, by as much as 5 mV. In Fig. we show the 
transient results obtained with PRIMA and TurboMOR models 
that match four moments (q = 2). Now, both models lead to 
a very accurate prediction of the original system response. 
The worst case transient error is indeed below 1 mV, as 
shown by Fig. This example shows that matching only 
two moments as in SIP pO) is not accurate enough for 
some applications. TurboMOR can instead match an arbitrary 
number of moments, and meet any accuracy requirement set 
by the user. 

C. Efficiency of the Reduced Models 

We now evaluate the efficiency of the ROMs generated by 
the proposed method, PRIMA, and SparseRC. In Table |I^ the 
simulation time for the original network and the various ROMs 
is reported. 

Without partitioning, TurboMOR produces ROMs that are 
consistently faster than PRIMA models. This is attributed to 
the block diagonal structure of the reduced models, which 



Fig. 7. Reduction time for PRIMA and TurboMOR without partitioning vs 
number of ports. Both methods match six moments {q = 3). 



Fig. 8. Reduction time for SparseRC and TurboMOR with partitioning vs 
number of ports. Both methods match six moments {q = 3). 

reduces the cost of the LU factorizations used to perform 
subsequent transient simulations. TurboMOR models are faster 
by up to five times. 

Comparing now the simulation times for the methods with 
partitioning (SparseRC and TurboMOR with partitioning), we 
observe that when two moments are matched {q — 1), the sim¬ 
ulation times are essentially the same, which is expected since 
both methods adopt the same reduction strategy. However, 
when additional moments are matched, TurboMOR delivers 
models that are always faster than those from SparseRC, 
because of higher sparsity. SparseRC uses PRIMA to match 
additional moments, which introduces some large and dense 
blocks in the ROM. 

D. Scalability 

Finally, we investigate the scalability of TurboMOR and 
existing methods with respect to network order and number 
of ports. Tests are performed on the first example (on-chip 
bus) for the case of six moments matched. 

1) Varying Number of Ports, Constant Node-to-Port Ratio: 
In the first test, we vary the number of signal lines and, 
consequently, ports. Since bus length is kept constant, the 
network order increases linearly with the number of ports. 
The node-to-port ratio remains constant at 150.5. 

Fig. |7] depicts the reduction time for TurboMOR (without 
partitioning) and PRIMA versus the number of ports. We 
observe that TurboMOR scales better than PRIMA, and time 
savings grow as port count increases. In Fig. the analysis 
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TABLE II 

Simulation time for the ROMs obtained with the different methods. All times in seconds. 


Examples 

Orisinal 


PRIMA 

SparseRC 

TurboMOR 

TurboMOR with partitioning 

Q 









Sim.Time 

Speedup 

Sim.Time 

Speedup 

Sim.Time 

Speedup 

Sim.Time 

Speedup 

Sim. lime 


1. On-chip bus 

3.00 

1 

0.07 

42.86X 

0.06 

50.00X 

0.07 

42.86X 

0.06 

50.00X 

p = 256 

2 

0.76 

3.95X 

0.80 

3.75X 

0.24 

12.50X 

0.60 

5.00X 

m = 38,528 


3 

3.23 

0.93X 

2.38 

1.26X 

0.54 

5.56X 

1.48 

2.03X 

2. ibmpglt (RC) 

2.26 

1 

0.23 

9.83X 

0.14 

16.14X 

0.13 

17.38X 

0.14 

16.14X 

p = 200 

2 

0.76 

2.97X 

0.32 

7.06X 

0.33 

6.85X 

0.27 

8.37X 

m = 25,195 


3 

1.92 

1.18X 

1.19 

1.90X 

0.61 

3.70X 

0.58 

3.90X 

3. ibmpg2t (RC) 

29.84 

1 

4.32 

6.91X 

1.58 

18.89X 

1.60 

18.65X 

1.61 

18.53X 

p = 800 

2 

12.98 

2.30X 

5.94 

5.02X 

8.11 

3.68X 

4.78 

6.24X 

m = 163,697 


3 

27.28 

1.09X 

12.68 

2.35X 

13.63 

2.19X 

7.55 

3.95X 

4. ibmpg2t (RC) 

30.24 

1 

8.58 

3.52X 

3.56 

8.49X 

3.67 

8.24X 

3.56 

8.49X 

p = 1200 

2 

28.75 

1.05X 

12.98 

2.33X 

20.30 

1.49X 

10.10 

2.99X 

m = 163,697 


3 

62.51 

0.48X 

28.40 

1.06X 

34.98 

0.86X 

16.21 

1.87X 

5. ibmpg2t (RC) 

30.69 

1 

13.31 

2.31X 

5.55 

5.53X 

5.98 

5.13X 

5.60 

5.48X 

p = 1500 

2 

45.32 

0.68X 

20.27 

1.51X 

35.23 

0.87X 

15.55 

1.97X 

m = 163,697 


3 

104.12 

0.29X 

42.93 

0.71 X 

60.40 

0.51X 

24.21 

1.27X 

6. ibmpg2t (RC) 

30.80 

1 

23.08 

1.33X 

9.45 

3.26X 

10.89 

2.83X 

9.58 

3.22X 

p = 2000 

2 

81.00 

0.38X 

34.84 

0.88X 

73.17 

0.42X 

26.80 

1.15X 

m = 163,697 


3 

173.67 

0.18X 

77.28 

0.40X 

121.54 

0.25X 

43.18 

0.71X 



Fig. 9. Reduction time for PRIMA and proposed method without partitioning, 
as a function of the ratio of network order and number of ports. 



Fig. 10. As in Fig. but for SparseRC and proposed method with 
partitioning. 


is repeated for the proposed method with partitioning and 
SparseRC. Also in this case, TurboMOR scales better than 
existing methods. 


2 ) Varying Node-to-Port Ratio, Constant Number of Ports: 
In the second test, we keep the number of ports constant to 
1024, which corresponds to 512 lines. We increase the number 
of nodes and, consequently, order by making the bus longer. 

Fig. 0 shows the reduction time for the two methods 
without partitioning (proposed and PRIMA) as a function of 
the number of nodes. Beyond a certain point, the reduction 
time for PRIMA increases dramatically, because the projection 
matrix becomes larger than the 16 GB of memory available on 
the machine. PRIMA starts resorting to slow swap memory, 
and becomes very inefficient. With TurboMOR, large pro¬ 
jection matrices are avoided. The matrices used to perform 
the congruence transformations are either sparse (Cholesky 
factor K) or stored in efficient factored form (Householder 
reflectors in This results in lower memory consumption, 

and allows TurboMOR to achieve high scalability even for 
very large port counts. In Fig. [T^ the analysis is repeated 
for TurboMOR with partitioning and SparseRC. The figure 
confirms the efficiency of the proposed models, which are 
faster than those generated by SparseRC especially for large 
systems with many ports. 

V. Conclusion 

We introduced TurboMOR, a new model order reduction 
method for large RC networks with many ports. TurboMOR 
achieves moment matching via efficient Householder transfor¬ 
mations, sparse matrix factorizations, and graph partitioning 
techniques. Differently from popular methods such as PRIMA, 
no large and dense projection matrices need to be computed 
nor stored. This feature makes TurboMOR more efficient than 
existing methods in terms of both CPU time and memory 
consumption. A key novelty of the proposed method is the 
sparse and block-diagonal structure of the generated models. 
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which makes them faster at run-time. Based on this structure, 
we provide a nice interpretation of moment matching in 
terms of system theory. TurboMOR models are passive by 
construction, and can be cast into an equivalent RC circuit, for 
seamless integration into electronic design automation tools. 
Numerical results demonstrate the superior performance of 
TurboMOR in reducing large passive networks with many 
ports, that arise more and more frequently in practice. 


Appendix A 

Proof of moment matching 


We prove that the reduced model ( [28l l, obtained after q 
iterations of the proposed method, matches 2q moments. We 
assume G invertible, since otherwise moments 0 are not 
dehned. If G is singular, the proposed method will still work, 
but one cannot speak of moment matching. 

The starting point of the proof is realization (12al-(T2^, 
which is obtained from the original system ( |10a| l-( 10b I by 
means of congruence transformation GD- Since GD is in¬ 
vertible by construction, the transformation does not change 
the transfer function nor the system moments. 

The key argument of the proposed proof is the derivation 
of the relation between the moments Mj; of the original 
system ( |12a| l-( [T2b| l and the moments of the inner subsys¬ 
tem GZl extracted by TurboMOR after one iteration. The 
transfer function of the original system ( |12a| l-( [T2bl l can be 
written as ED, ED 


H(s) = Bj 


G 


( 1 ) 


sGfi^ - s2Hi(s) 


1 -1 


Bi (34) 


where 


Hi(s) = (G 22 + sC 22 )-^d 2 i (35) 

is the transfer function of the inner subsystem The 

moments of this subsystem are denoted with N/, so we have 


+ 00 


Hi(s) 


(36) 


/=o 


After substituting Q and (|3^ into ED’ obtain 


+00 




+00 -l —1 

( 1 ) , 


Y, = Bj g'V + sC[\> 


1=0 


Bi (37) 


For circuits, matrix Bi is typically a permutation of the iden¬ 
tity matrix, and is thus invertiblE] We can thus rewrite ED 


+ 00 


1 +00 




Gfi>+sGiV-^N,s'+2 Y^i^Mks'^ = B, (38) 




*If Bi is not full rank, a correlation between some inputs exists, which can 
be extracted before the reduction 0, making the ROM smaller and leading 
to a full-rank Bi. 


where superscript ^ denotes the inverse of the transpose. 
After exchanging the two series, we have 


+ 00 r 


/c =0 




+00 






Z =0 


= Bi (39) 


Both sides of ( |39| l are polynomials in s that, in order to be 
equal, must have the same coefficients. Imposing the equality 
between the coefficients of we obtain 


G^Br^Mo =Bi 


Mo = B 




Bi 


(40) 


The inverse of G^^^ exists since we G is non-singular. By 
equating the coefficients of s^, we have 


Ml = -Bf(G(lVGW(G(lViBi (41) 


Equations ( |40| ) and GD show that the hrst two moments of 
the original system just depend on the matrices Bi, G^^^ and 
G^^\ Such matrices are preserved in reduced model ( [T8] l, 
which thus matches the first two moments of the original 
system. By equating the coefficients of a generic power s’’ 
in ED for r > 2, we obtain the recursive relation 


M, = -Bf( g^) ^G^Br^M,.! 

+ Bf(GW)”'^N,Br^M,_,_2 (42) 

1=0 

Equation ED shows that the moment M,. of order r of the 
original system ( |12a[ )-( [T2bl i depends on: 

1 ) the matrices Bi, G^^^ and of the outer subsys¬ 
tem GD’ which are always preserved in the reduced 
model ED’ 

2 ) the moments N; of the inner subsystem GD up to order 
r-2. 

Therefore, if one replaces the nested subsystem GD with a 
reduced model that preserves its first r — 2 moments, then the 
overall model will match r moments of the original system. 
By iterating this argument, it is straightforward to prove that 
ROM ED ni^tches 2q moments of the original system. 

The developed relation between the moments of the original 
system and the moments of its inner subsystem GD ^ 

fundamental role in the proposed method. It allows us to match 
moments recursively, two at a time, by iterative application of 
the same transformation to subsystems of decreasing size. The 
proposed proof is also applicable to the ROMs obtained from 
other techniques such as SparseRC pp . The main differences 
between our proof and the one in |[^ are two. Eirst, the 
proof in pD considers only the first two moments, while 
ours is general. Second, pp proves moment matching for 
the moments of the network admittance. Our proof is instead 
based on the original impedance representation of network 0. 
Our contribution therefore establishes the equivalence, from 
a moment-matching perspective, of fast MOR methods (pro¬ 
posed, SparseRC) and PRIMA. 
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